288 research outputs found

    SSMART: Sequence-structure motif identification for RNA-binding proteins

    Get PDF
    MOTIVATION: RNA-binding proteins (RBPs) regulate every aspect of RNA metabolism and function. There are hundreds of RBPs encoded in the eukaryotic genomes, and each recognize its RNA targets through a specific mixture of RNA sequence and structure properties. For most RBPs, however, only a primary sequence motif has been determined, while the structure of the binding sites is uncharacterized. RESULTS: We developed SSMART, an RNA motif finder that simultaneously models the primary sequence and the structural properties of the RNA targets sites. The sequence-structure motifs are represented as consensus strings over a degenerate alphabet, extending the IUPAC codes for nucleotides to account for secondary structure preferences. Evaluation on synthetic data showed that SSMART is able to recover both sequence and structure motifs implanted into 3'UTR-like sequences, for various degrees of structured/unstructured binding sites. In addition, we successfully used SSMART on high-throughput in vivo and in vitro data, showing that we not only recover the known sequence motif, but also gain insight into the structural preferences of the RBP. AVAILABILITY: SSMART is freely available at https://ohlerlab.mdc-berlin.de/software/SSMART 137

    SaTAnn quantifies translation on the functionally heterogeneous transcriptome

    Get PDF
    Deep sequencing methods have matured to comprehensively detect the full set of transcribed loci, but there is a gap to determine the function of the resulting highly complex transcriptomes. At the center of the gene expression cascade, translation is fundamental in defining the fate of much of the transcribed genome. We have developed a new approach (SaTAnn, Splice-aware Translatome Annotation) to annotate and quantify translation at the single open reading frame (ORF) level, that uses information from ribosome profiling to determine the translational state of each isoform in a comprehensive annotation. For most genes, one ORF represents the dominant translation product, but our approach also detects translation from ORFs belonging to multiple transcripts per gene, including targets of RNA surveillance mechanisms such as nonsense-mediated decay. Diversity in the translation output across human cell lines reveals the extent of gene-specific differences in protein production, which are supported by steady-state protein abundance estimates. Computational analysis of Ribo-seq data with SaTAnn (available at https://github.com/lcalviell/SaTAnn) provides a window into the functions of the heterogeneous transcriptom

    Finding RNA structure in the unstructured RBPome

    Get PDF
    BACKGROUND: RNA-binding proteins (RBPs) play vital roles in many processes in the cell. Different RBPs bind RNA with different sequence and structure specificities. While sequence specificities for a large set of 205 RBPs have been reported through the RNAcompete compendium, structure specificities are known for only a small fraction. The main limitation lies in the design of the RNAcompete technology, which tests RBP binding against unstructured RNA probes, making it difficult to infer structural preferences from these data. We recently developed RCK, an algorithm to infer sequence and structural binding models from RNAcompete data. The set of binding models enables, for the first time, a large-scale assessment of RNA structure in the RBPome. RESULTS: We re-validate and uncover the role of RNA structure in the RPBome through novel analysis of the largest-scale dataset to date. First, we show that RNA structure exists in presumably unstructured RNA probes and that its variability is correlated with RNA-binding. Second, we examine the structural binding preferences of RBPs and discover an overall preference to bind RNA loops. Third, we significantly improve protein-binding prediction using RNA structure, both in vitro and in vivo. Lastly, we demonstrate that RNA structural binding preferences can be inferred for new proteins from solely their amino acid content. CONCLUSIONS: By counter-intuitively demonstrating through our analysis that we can predict both the RNA structure of and RBP binding to these putatively unstructured RNAs, we transform a compendium of RNA-binding proteins into a valuable resource for structure-based binding models. We uncover the important role RNA structure plays in protein-RNA interaction for hundreds of RNA-binding proteins

    Sustained-input switches for transcription factors and microRNAs are central building blocks of eukaryotic gene circuits

    Get PDF
    WaRSwap is a randomization algorithm that for the first time provides a practical network motif discovery method for large multi-layer networks, for example those that include transcription factors, microRNAs, and non-regulatory protein coding genes. The algorithm is applicable to systems with tens of thousands of genes, while accounting for critical aspects of biological networks, including self-loops, large hubs, and target rearrangements. We validate WaRSwap on a newly inferred regulatory network from Arabidopsis thaliana, and compare outcomes on published Drosophila and human networks. Specifically, sustained input switches are among the few over-represented circuits across this diverse set of eukaryotes

    Cseq-simulator: a data simulator for CLIP-Seq experiments

    Get PDF
    CLIP-Seq protocols such as PAR-CLIP, HITS-CLIP or iCLIP allow a genome-wide analysis of protein-RNA interactions. For the processing of the resulting short read data, various tools are utilized. Some of these tools were specifically developed for CLIP-Seq data, whereas others were designed for the analysis of RNA-Seq data. To this date, however, it has not been assessed which of the available tools are most appropriate for the analysis of CLIP-Seq data. This is because an experimental gold standard dataset on which methods can be accessed and compared, is still not available. To address this lack of a gold-standard dataset, we here present Cseq-Simulator, a simulator for PAR-CLIP, HITS-CLIP and iCLIP-data. This simulator can be applied to generate realistic datasets that can serve as surrogates for experimental gold standard dataset. In this work, we also show how Cseq-Simulator can be used to perform a comparison of steps of typical CLIP-Seq analysis pipelines, such as the read alignment or the peak calling. These comparisons show which tools are useful in different settings and also allow identifying pitfalls in the data analysis

    COUGER-co-factors associated with uniquely-bound genomic regions

    Get PDF
    Most transcription factors (TFs) belong to protein families that share a common DNA binding domain and have very similar DNA binding preferences. However, many paralogous TFs (i.e. members of the same TF family) perform different regulatory functions and interact with different genomic regions in the cell. A potential mechanism for achieving this differential in vivo specificity is through interactions with protein co-factors. Computational tools for studying the genomic binding profiles of paralogous TFs and identifying their putative co-factors are currently lacking. Here, we present an interactive web implementation of COUGER, a classification-based framework for identifying protein co-factors that might provide specificity to paralogous TFs. COUGER takes as input two sets of genomic regions bound by paralogous TFs, and it identifies a small set of putative co-factors that best distinguish the two sets of sequences. To achieve this task, COUGER uses a classification approach, with features that reflect the DNA-binding specificities of the putative co-factors. The identified co-factors are presented in a user-friendly output page, together with information that allows the user to understand and to explore the contributions of individual co-factor features. COUGER can be run as a stand-alone tool or through a web interface: http://couger.oit.duke.edu

    omniCLIP: probabilistic identification of protein-RNA interactions from CLIP-seq data

    Get PDF
    CLIP-seq methods allow the generation of genome-wide maps of RNA binding protein - RNA interaction sites. However, due to differences between different CLIP-seq assays, existing computational approaches to analyze the data can only be applied to a subset of assays. Here, we present a probabilistic model called omniCLIP that can detect regulatory elements in RNAs from data of all CLIP-seq assays. omniCLIP jointly models data across replicates and can integrate background information. Therefore, omniCLIP greatly simplifies the data analysis, increases the reliability of results and paves the way for integrative studies based on data from different assays

    Automated annotation of gene expression image sequences via non-parametric factor analysis and conditional random fields

    Get PDF
    Motivation: Computational approaches for the annotation of phenotypes from image data have shown promising results across many applications, and provide rich and valuable information for studying gene function and interactions. While data are often available both at high spatial resolution and across multiple time points, phenotypes are frequently annotated independently, for individual time points only. In particular, for the analysis of developmental gene expression patterns, it is biologically sensible when images across multiple time points are jointly accounted for, such that spatial and temporal dependencies are captured simultaneously. Methods: We describe a discriminative undirected graphical model to label gene-expression time-series image data, with an efficient training and decoding method based on the junction tree algorithm. The approach is based on an effective feature selection technique, consisting of a non-parametric sparse Bayesian factor analysis model. The result is a flexible framework, which can handle large-scale data with noisy incomplete samples, i.e. it can tolerate data missing from individual time points. Results: Using the annotation of gene expression patterns across stages of Drosophila embryonic development as an example, we demonstrate that our method achieves superior accuracy, gained by jointly annotating phenotype sequences, when compared with previous models that annotate each stage in isolation. The experimental results on missing data indicate that our joint learning method successfully annotates genes for which no expression data are available for one or more stages

    Ribo-seQC: comprehensive analysis of cytoplasmic and organellar ribosome profiling data

    Get PDF
    Summary: Ribosome profiling enables genome-wide analysis of translation with unprecedented resolution. We present Ribo-seQC, a versatile tool for the comprehensive analysis of Ribo-seq data, providing in-depth insights on data quality and translational profiles for cytoplasmic and organelle ribosomes. Ribo-seQC automatically generates platform-independent HTML reports, offering a detailed and easy-to-share basis for collaborative Ribo-seq projects. Availability: Ribo-seQC is available at https://github.com/ohlerlab/RiboseQC and submitted to Bioconductor. Contact: uwe.ohler{at}mdc-berlin.d
    corecore